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Abstract 

Various methods have been proposed for aligning 
texts in two or more languages such as the 
Canadian Parliamentary Debates (Hansards). Some 
of these methods generate a bilingual lexicon as a 
by-product. We present an alternative alignment 
strategy which we call K-vec, that starts by 
estimating the lexicon. For example, it discovers 
that the English word fisheries is similar to the 
French peches by noting that the distribution of 
fisheries in the English text is similar to the 
distribution of peches in the French. K-vec does 
not depend on sentence boundaries. 

1. Motivation 

There have been quite a number of recent papers on 
parallel text: Brown et al (1990, 1991, 1993), Chen 
(1993), Church (1993), Church et al (1993), Dagan 
et al (1993), Gale and Church (1991, 1993), 
Isabelle (1992), Kay and Rosenschein (1993), 
Klavans and Tzoukermann (1990), Kupiec (1993), 
Matsumoto (1991), Ogden and Gonzales (1993), 
Shemtov (1993), Simard et al (1992), Warwick- 
Armstrong and Russell (1990), Wu (to appear). 
Most of this work has been focused on European 
language pairs, especially English-French. It 
remains an open question how well these methods 
might generalize to other language pairs, especially 
pairs such as English-Japanese and English- 
Chinese. 

In previous work (Church et al, 1993), we have 
reported some preliminary success in aligning the 
English and Japanese versions of the AWK manual 
(Aho, Kernighan, Weinberger (1980)), using 



char_align (Church, 1993), a method that looks for 
character sequences that are the same in both the 
source and target. The char_align method was 
designed for European language pairs, where 
cognates often share character sequences, e.g., 
government and gouvernement. In general, this 
approach doesn't work between languages such as 
English and Japanese which are written in different 
alphabets. The AWK manual happens to contain a 
large number of examples and technical words that 
are the same in the English source and target 
Japanese. 

It remains an open question how we might be able 
to align a broader class of texts, especially those 
that are written in different character sets and share 
relatively few character sequences. The K-vec 
method attempts to address this question. 

2. The K-vec Algorithm 

K-vec starts by estimating the lexicon. Consider 
the example: fisheries -^ peches. The K-vec 
algorithm will discover this fact by noting that the 
distribution of fisheries in the English text is 
similar to the distribution oi peches in the French. 

The concordances for fisheries and peches are 
shown in Tables 1 and 2 (at the end of this paper).' 



These tables were computed from a small fragment of the 
Canadian Hansards that has been used in a number of other 
studies: Church (1993) and Simard et al (1992). The 
English text has 165,160 words and the French text has 
185,615 words. 



There are 19 instances of fisheries and 21 instances 
of peches. The numbers along the left hand edge 
show where the concordances were found in the 
texts. We want to know whether the distribution of 
numbers in Table 1 is similar to those in Table 2, 
and if so, we will suspect that fisheries and peches 
are translations of one another. A quick look at the 
two tables suggests that the two distributions are 
probably very similar, though not quite identical.^ 

We use a simple representation of the distribution 
of fisheries and peches. The English text and the 
French text were each split into K pieces. Then we 
determine whether or not the word in question 
appears in each of the K pieces. Thus, we denote 
the distribution oi fisheries in the English text with 
a K-dimensional binary vector, Vy, and similarly, 
we denote the distribution of peches in the French 
text with a K-dimensional binary vector, Vp. The 
/''' bit of Vf indicates whether or not Fisheries 



'f 



:th 



occurs in the / piece of the English text, and 
similarly, the / '^ bit of Vp indicates whether or not 
peches occurs in the i'^ piece of the French text. 

If we take K be 10, the first three instances of 
fisheries in Table 1 fall into piece 2, and the 
remaining 16 fall into piece 8. Similarly, the first 4 
instances oi peches in Table 2 fall into piece 2, and 
the remaining 17 fall into piece 8. Thus, 

\/^.= yp=:<0,0,l,0,0,0,0,0,l,0> 

Now, we want to know if Vf is similar to Vp, and if 
we find that it is, then we will suspect that fisheries 
— > peches. In this example, of course, the vectors 
are identical, so practically any reasonable 
similarity statistic ought to produce the desired 
result. 



3. fisheries is not the translation of lections 

Before describing how we estimate the similarity of 
Vf and Vp , let us see what would happen if we tried 
to compare fisheries with a completely unrelated 
word, eg., lections. (This word should be the 
translation of elections, not fisheries.) 



2. At most, fisheries can account for only 19 instances of 
peches, leaving at least 2 instances of peches unexplained. 



As can be seen in the concordances in Table 3 , for 
K=10, the vector is <1, 1, 0, 1, 1, 0, 1, 0, 0, 0>. By 
almost any measure of similarity one could 
imagine, this vector will be found to be quite 
different from the one for fisheries, and therefore, 
we will correctly discover that fisheries is not the 
translation of lections. 

To make this argument a little more precise, it 
might help to compare the contingency matrices in 
Tables 5 and 6. The contingency matrices show: 
(a) the number of pieces where both the English 
and French word were found, (b) the number of 
pieces where just the English word was found, (c) 
the number of pieces where just the French word 
was found, and (d) the number of peices where 
neither word was found. 



Table 4: A contingency matrix 
French 



English 



b 
d 



Table 5: fisheries vs. peches 
peches 



fisheries 



2 





8 



Table 6: fisheries vs. lections 
lections 



fisheries 





4 



2 
4 



In general, if the English and French words are 
good translations of one another, as in Table 5, then 
a should be large, and b and c should be small. In 
contrast, if the two words are not good translations 
of one another, as in Table 6, then a should be 
small, and b and c should be large. 

4. Mutual Information 

Intuitively, these statements seem to be true, but we 
need to make them more precise. One could have 
chosen quite a number of similarity metrics for this 
purpose. We use mutual information: 



log^ 



prob{Vf,Vp) 
prob(Vf) prob(Vp) 



That is, we want to compare the probabiUty of 
seeing fisheries and peches in the same piece to 
chance. The probability of seeing the two words in 
the same piece is simply: 



prob(Vf,Vp) = 



a 



a +b +c +d 



The marginal probabilities are: 
prob(Vf) = 

prob(Vp) = 



a+b+c+d 

a+c 
a+b+c +d 



For fisheries -^ peches, prob(Vf,Vp) =prob(Vf) 
=prob(Vp) =0.2. Thus, the mutual information is 
log 2 5 or 2.32 bits, meaning that the joint 
probability is 5 times more likely than chance. In 
contrast, for fisheries -^ lections, prob{Vf,Vp) =0, 
prob(Vf) =0.5 and prob(Vp) = 0.4. Thus, the 
mutual information is log2 0, meaning that the joint 
is infinitely less likely than chance. We conclude 
that it is quite likely that fisheries and peches are 
translations of one another, much more so than 
fisheries and lections. 

5. Significance 

Unfortunately, mutual information is often 
unreliable when the counts are small. For example, 
there are lots of infrequent words. If we pick a pair 
of these words at random, there is a very large 
chance that they would receive a large mutual 
information value by chance. For example, let e be 
an English word that appeared just once and let/ be 
a French word that appeared just once. Then, there 

is a non-trivial chance ( — ) that e and /will appear 
K 

in the same piece, as shown in Table 7. If this 
should happen, the mutual information estimate 
would be very large, i.e., log^, and probably 
misleading. 



Tal 


3le7: 
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e 


1 
9 



filter out insignificant mutual information values. 
prob(Vf,Vp)-prob(Vf) prob(Vp) 



< 



K 



prob(Vf,Vp) 



Using the numbers in Table 7, t~l, which is not 
significant. (A t of 1.65 or more would be 
significant at the p>0.95 confidence level.) 

Similarly, if e and / appeared in just two pieces 
each, then there is approximately a — =" chance that 

they would both appear in the same two pieces, and 
then the mutual information score would be quite 

high, log — , but we probably wouldn't believe it 

because the ?-score would be only a/2. By this 
definition of significance, we need to see the two 
words in at least 3 different pieces before the result 
would be considered significant. 

This means, unfortunately, that we would reject 
fisheries -^ peches because we found them in only 
two pieces. The problem, of course, is that we 
don't have enough pieces. When K=10, there 
simply isn't enough resolution to see what's going 
on. At K=100, we obtain the contingency matrix 
shown in Table 8, and the ?-score is significant 
(f=2.1). 

Table 8: K=100 

peches 



fisheries 



5 




1 
94 



In order to avoid this problem, we use a f-score to 



How do we choose K? As we have seen, if we 
choose too small a K, then the mutual information 
values will be unreliable. However, we can only 
increase K up to a point. If we set K to a 
ridiculously large value, say the size of the English 
text, then an English word and its translations are 
likely to fall in slightly different pieces due to 
random fluctuations and we would miss the signal. 
For this work, we set K to the square root of the 
size of the corpus. 

K should be thought of as a scale parameter. If we 
use too low a resolution, then everything turns into 
a blur and it is hard to see anything. But if we use 
too high a resolution, then we can miss the signal if 



it isn't just exactly where we are looking. 

Ideally, we would like to apply the K-vec algorithm 
to all pairs of English and French words, but 
unfortunately, there are too many such pairs to 
consider. We therefore limited the search to pairs 
of words in the frequency range: 3-10. This 
heuristic makes the search practical, and catches 
many interesting pairs. ^ 

6. Results 

This algorithm was applied to a fragment of the 
Canadian Hansards that has been used in a number 
of other studies: Church (1993) and Simard et al 
(1992). The 30 significant pairs with the largest 
mutual information values are shown in Table 9. 
As can be seen, the results provide a quick-and- 
dirty estimate of a bilingual lexicon. When the pair 
is not a direct translation, it is often the translation 
of a collocate, as illustrated by acheteur -^ Limited 
and Sante -^ Welfare. (Note that some words in 
Table 9 are spelled with same way in English and 
French; this information is not used by the K-vec 
algorithm). 

Using a scatter plot technique developed by Church 
and Helfman (1993) called dotplot, we can visulize 
the alignment, as illustrated in Figure 1. The 
source text (A^^ bytes) is concatenated to the target 
text {Ny bytes) to form a single input sequence of 
Njc+Ny bytes. A dot is placed in position i,j 
whenever the input token at position / is the same 
as the input token at position/ 

The equality constraint is relaxed in Figure 2. A 
dot is placed in position i,j whenever the input 
token at position / is highly associated with the 
input token at position j as determined by the 
mutual information score of their respective K- 
vecs. In addition, it shows a detailed, magnified 
and rotated view of the diagonal line. The 
alignment program tracks this line with as much 
precision as possible. 



Table 9: K-vec results 






French 


English 


3.2 


Beauce 


Beauce 


3.2 


Comeau 


Comeau 


3.2 


1981 


1981 


3.0 


Richmond 


Richmond 


3.0 


Rail 


VIA 


3.0 


peches 


Fisheries 


2.8 


Deans 


Deans 


2.8 


Prud 


Prud 


2.8 


Prud 


homme 


2.7 


acheteur 


Limited 


2.7 


Communications 


Communications 


2.7 


MacDonald 


MacDonald 


2.6 


Mazankowski 


Mazankowski 


2.5 


croisiere 


nuclear 


2.5 


Sante 


Welfare 


2.5 


39 


39 


2.5 


Johnston 


Johnston 


2.5 


essais 


nuclear 


2.5 


Universite 


University 


2.5 


bois 


lumber 


2.5 


Angus 


Angus 


2.4 


Angus 


VIA 


2.4 


Saskatoon 


University 


2.4 


agriculteurs 


farmers 


2.4 


inflation 


inflation 


2.4 


James 


James 


2.4 


Vanier 


Vanier 


2.4 


Sante 


Health 


2.3 


royale 


languages 


2.3 


grief 


grievance 


7. Conclusions 





The low frequency words (frequency less then 3) would 
have been rejected anyways as insignificant. 



The K-vec algorithm generates a quick-and-dirty 
estimate of a bilingual lexicon. This estimate could 
be used as a starting point for a more detailed 
alignment algorithm such as word_align (Dagan et 
al, 1993). In this way, we might be able to apply 
word_align to a broader class of language 
combinations including possibly English-Japanese 
and English-Chinese. Currently, word_align 
depends on char_align (Church, 1993) to generate 
a starting point, which limits its applicability to 
European languages since char_align was designed 
for language pairs that share a common alphabet. 
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28312 
28388 
28440 
128630 
128885 
128907 
130887 
132282 
132629 
132996 
134026 
134186 
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134394 
134785 
134796 
134834 
134876 



Mr . Speaker , my question is for the Minister of 

of the stocks ? Hon . Thomas Siddon ( Minister of 

calculation on which the provincial Department of 

private sector is quite weak . Let us turn now to 

The fishermen would like to see the Department of 

s in particular . The budget of the Department of 

' habitation ' ' from which to base his trade in 

ase just outside of my riding . The Department of 

and all indications are that the richness of its 

taxpayer . The role of the federal Department of 

is the new Chairman of the Standing Committee on 

ortunity to discuss it with me as a member of the 

he proposal has been submitted to the Minister of 

ch as well as on his selection as Chairman of the 

his intense interest and expertise in the area of 

r from Eastern Canada and the new Chairman of the 

d Oceans Committee . We know that the Minister of 

ows the importance of research and development to 

research and development component in the area of 



Fisheries and Oceans . Allegations have been made 
Fisheries and Oceans ): Mr . Speaker , I tell the 
Fisheries makes this allegation and I find that it 
fisheries , an industry which as most important fo 
Fisheries and Oceans put more effort towards the p 
Fisheries and Oceans has been reduced to such a le 
fisheries and furs . He brought with him the first 
Fisheries and Oceans provides employment for many 
fisheries resource will enable it to maintain its 
Fisheries and Oceans is central to the concerns of 
Fisheries and Oceans . I am sure he will bring a w 
Fisheries Committee . The Hon . Member asked what 
Fisheries and Oceans ( Mr . Siddon ) which I hope 
Fisheries Committee . I have worked with Mr . Come 
fisheries . It seems most appropriate , given that 
Fisheries and Oceans Committee . We know that the 
Fisheries and Oceans ( Mr . Siddon ) , should we s 
fisheries and oceans . Is he now ready to tell the 
fisheries and oceans at Bedford , in order that th 



Table 2: Concordances for peches 



31547 
31590 
31671 
31728 
144855 
145100 
145121 
148873 
149085 
149837 
149960 
151108 
151292 
151398 
151498 
151521 
151936 
151947 
151997 
152049 
152168 



oyez certain que je presenterai mes excuses . Les peches 

esident , ma question s ' adresse au ministre des Peches 

poissons ? L ' hon . Thomas Siddon ( ministre des Peches 

calculs sur lesquels le ministere provincial des peches 

ive est beaucoup plus faible . Parlous un peu des peches 

braconnage . lis voudraient que le ministere des Peches 

es stocks de homards . Le budget du ministere des Peches 

endant 1 ' hiver , lorsque 1 ' agriculture et les peches 

xterieur de ma circonscription . Le ministere des Peches 

s . Dans le rapport Kirby de 1983 portant sur les peches 

eniers publics . Le role du ministere federal des Peches 

soit le nouveau president du comite permanent des peches 

avec moi , en ma qualite de membre du comite des peches 

is savoir qu ' elle a te proposee au ministre des Peches 

de son choix au poste de president du comite des peches 

et je connais tout 1 ' interet qu ' il porte aux peches 

Est du pays et maintenant president du Comite des peches 

eches et des oceans . On salt que le ministre des Peches 

recherche et du developpement dans le domaine des peches 

recherche et du developpement dans le domaine des peches 

s endroits ou'g ils se trouvent et 1 ' avenir des peches 



L ' existence possible d ' un marche noir e 
et des Oceans . On aurait peche , debarque 
et des Oceans ) 

fonde ses allegations , et j ' y ai releve 
, un secteur tres important pour 1 ' Atlant 
et des Oceans fasse davantage , particulier 
et des Oceans a te ampute de telle sorte qu 
sont peu pres leur point mort , bon nombre 
et des Oceans assure de 1 ' emploi bien d ' 
de la cote est , on a mal explique le syste 
et des Oceans se trouve au centre des preoc 
et oceans . Je suis sur que ses vastes conn 
et oceans . Le depute a demande quelles per 
et Oceans ( M . Siddon ) et j ' espere qu ' 
. Je travaille avec M . Comeau depuis deux 
, ainsi que sa competence cet gard . Cela s 
et des oceans . On salt que le ministre des 
et des Oceans ( M . Siddon ) a , disons , a 
et des oceans . Est - il pret aujourd ' hui 
et des oceans Bedford afin que ce laboratoi 
dans 1 ' Est . Le president suppleant ( M . 



Table 3: Concordances for lections 



207 
12439 
14999 
16164 
16386 
16389 
16431 
17419 
17427 
17438 
17461 
55169 
56641 
57853 
59027 
67980 
70161 
70456 
103132 
103186 



de prendre la parole aujourd ' hui . Bien que les lections 

ui servent ensemble la Chambre des communes . Les lections 

n place les mesures de controle suffisantes . Les lections 

reprendre le contenu de son discours lectoral des lections 

ertainement et s ' en rappelleront aux prochaines lections 

n apercevront encore une fois lors des prochaines lections 

ncore une fois lors des prochaines lections . Des lections 

avec eux - memes 1 ' analyse des resultats de ces lections 

s et ils reagissent . Ils ont reagi aux dernieres lections 

ementaires et ils reagiront encore aux prochaines lections 

t , monsieur le President , parlant de prochaines lections 

M . Layton ) dire tantot que , anterieurement aux lections 

etitions . Je suggererais au Comite permanent des lections 

ulever cette question au comite des privileges et lections 

doivent tre renvoyees au comite des privileges et lections 

ret soumettre la question au comite permanent des lections 

le 16 Janvier 1986 . . . M . Hovdebo: Apres les lections 

tinuer faire ce qu ' ils ont fait depuis quelques lections 

que les gens le retiennent jusqu ' aux prochaines lections 

done transmis mon mandat au directeur general des lections 

, deux deputes ont avise le directeur general des lections 



au cours desquelles on nous a lus la tete 
qui se sont tenues au debut de la deuxiem 
approchaient et les liberaux voulaient me 
de 1984 . On se rappelle , et tous les Ca 
de tout ce qui aurait pu leur arriver . L 
. Des lections , monsieur le President , 
, monsieur le President , il y en a eu de 
complementaires , constateront qu ' ils o 
complementaires et ils reagiront encore a 
. Finalement , monsieur le President , pa 
. . . j ' coutais mon honorable collegue 
de 1984 , les gens de Lachine voulaient u 
, des privileges et de la procedure d ' t 
, car il y a de serieux doutes sur 1 ' in 
. J ' ai 1 ' intention d ' en saisir ce c 
, des privileges et de la procedure . J ' 
. M . James: . . . le ministre d ' alors 
, c ' est - - dire rejeter le Nouveau par 
. De cette fa9on vous allez tre rejetes d 
, afin de 1 ' autoriser mettre un nouveau 
d ' une vacance survenue la Chambre ; il 



